Skip to content

feat(server): support KV Flash with same-backend target layer split#391

Open
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat-kvflash-same-backend-layer-split
Open

feat(server): support KV Flash with same-backend target layer split#391
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat-kvflash-same-backend-layer-split

Conversation

@weicj

@weicj weicj commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR extends KV Flash bounded KV residency to same-backend target layer split, so split targets can use a pool-sized KV cache instead of allocating full-context KV on every shard. The same-backend path is wired for Qwen35, Gemma4, and Laguna.

Changes

  • Add KV Flash pool allocation and pager lifecycle support to Qwen35, Gemma4, and Laguna same-backend target layer split.
  • Thread KvFlashPager through same-backend layer-split prefill/decode paths, including Qwen35 DFlash verify.
  • Build layer-split attention graphs in pool-slot mode when KV Flash is enabled, including set-rows KV writes and slot-space attention masks.
  • Align Gemma4 split boundaries so KV-sharing layers stay on the same shard as their source layer.
  • Guard Laguna KV Flash mask uploads so only tensors allocated for the current graph are written.

Notes

  • Local same-backend runtime smoke passed on dual Pro VII HIP with Qwen3.6-27B Q3, --target-devices hip:0,hip:1 --target-layer-split 1,1, and DFLASH_KVFLASH=1024; the request returned a valid OpenAI-compatible response.
  • Local same-backend runtime smoke passed on dual Pro VII HIP with Gemma4 E2B Q4, --target-devices hip:0,hip:1 --target-layer-split 1,1, and DFLASH_KVFLASH=512; the server auto-aligned shards to [0,14)+[14,35) and returned a valid OpenAI-compatible response.
  • Local same-backend runtime smoke passed on dual Pro VII HIP with Laguna-XS.2 Q4, --target-devices hip:0,hip:1 --target-layer-split 1,1, and DFLASH_KVFLASH=512; the pool was raised to 768 tokens for the SWA tail requirement and the request returned a valid OpenAI-compatible response.

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 20 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp
Comment thread server/src/gemma4/gemma4_layer_split_adapter.cpp Outdated
Comment thread server/src/laguna/laguna_layer_split_adapter.cpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant